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Abstract 

In this work we address the task of segmenting an object into its parts, or semantic 
part segmentation. We start by adapting a state-of-the-art semantic segmentation 
system to this task, and show that a combination of a fully-convolutional Deep 
CNN system coupled with Dense CRF labelling provides excellent results for a 
broad range of object categories. Still, this approach remains agnostic to high- 
level constraints between object parts. We introduce such prior information by 
means of the Restricted Boltzmann Machine, adapted to our task and train our 
model in an discriminative fashion, as a hidden CRF, demonstrating that prior 
information can yield additional improvements. We also investigate the perfor¬ 
mance of our approach “in the wild”, without information concerning the ob¬ 
jects’ bounding boxes, using an object detector to guide a multi-scale segmenta¬ 
tion scheme. 

We evaluate the performance of our approach on the Penn-Fudan and LFW 
datasets for the tasks of pedestrian parsing and face labelling respectively. We 
show superior performance with respect to competitive methods that have been 
extensively engineered on these benchmarks, as well as realistic qualitative results 
on part segmentation, even for occluded or deformable objects. We also provide 
quantitative and extensive qualitative results on three classes from the PASCAL 
Parts dataset. Finally, we show that our multi-scale segmentation scheme can 
boost accuracy, recovering segmentations for finer parts. 


1 Introduction 


Recently Deep Convolutional Neural Networks (DCNNs) have delivered excellent results in a broad 
range of computer vision problems, including but not limited to image classification [Krizhevsky 


et al.|(|2012 ) ;|Sermanet et al.| ( |2014| ) ; [Simonya n & Zisserman|(|2014]);|Szegedy et al.| ( 2014[ );|Papan 


dreou et aL|( 2014 ), semantic segmentation Chen et al. (|2014a|); Long et al. ~( 2014| ), object detection 
Girshi cket al.| ( [2014| ) and fine-grained categorization |Zhang et al.| ( |2014| ). Given this broad success, 


DCNNs seem to have become the method of choice for any image understanding task where the 
input-output mapping can be described as a classification problem and the output space is a set of 
discrete labels. 


In this work we investigate the use of DCNNs to address the problem of semantic part segmentation, 
namely segmenting an object into its constituent parts. Part segmentation is an important subprob- 
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lem for tasks such as recognition, pose estimation, tracking, or applications that require accurate 
segmentation of complex shapes, such as a host of medical applications. For example, state-of-the- 
art methods for fine-grained categorization rely on the localization and/or segmentation of object 
parts [Zhang et ah] ( |2014| . 

Part segmentation is also interesting from the modeling perspective, as the configurations of parts 
are, for most objects, highly structured. Incorporating prior knowledge about the parts of an object 
lends itself naturally to structured prediction, which aims at training a map whose output space has a 
well defined structure. The key question addressed in this paper is how DCNN can be combined with 
structured output prediction to effectively parse object parts. In this manner, one can combine the 
discriminative power of CNNs to identify part positions and prior information about object layout, 
to recover from possible failures of the CNN. Integrating DCNNs with structured prediction is not 
novel. For example, early work by ( [LeCun et al.| used Graph Transformer Networks for parsing 
ID lines into digits. However, the combination of DCNN with models of shape, such as the Shape 
Boltzmann Machines, or of object parts, such as Def ormable Part Models, are recent Tompso n et al. 
( 2014| ); [Schwing & Urtasun| ( |2015| ); |Wan et al.| ( [2014| ). 


This work makes several contributions. First, we show that, by adapting the semantic segmentation 
system of [Chen et al. ( |2014a| ) (Section. [3]), it is possible to obtain excellent results in part segmen¬ 
tation. This system uses a dense Conditional Random Field (CRF) applied on top of the output of 
a DCNN. This simple and non-specialized combination often outperforms specialized approaches 
to part segmentation and localization by a substantial margin. In Section.]?] we turn to the problem 
of augmenting this system with a statistical model of the shape of the object and its parts. A key 
challenge is that the shape of parts is subject to substantial geometric variations, including poten¬ 
tially a variable number of parts per instance, caused by variations in the object pose. We model 
this variability using Restricted Boltzmann Machines (RBMs). These implicitly incorporate rich 
distributed mixture models in a representation that is particularly effective at capturing complex 
localized variations in shape. 


In order to use RBMs with DCNNs in a structured-output prediction formulation, we modify RBMs 
in several ways: first, we use hidden CRF training to estimate the RBM parameters in a discrimi¬ 
native manner, aiming at maximizing the posterior likelihood of the ground-truth part masks given 
the DCNN scores as input. We demonstrate that this can yield an improvement over the raw DCNN 
scores by injecting high-level knowledge about the desired object layout. 

Extensive experimental results in Section. [5j confirm the merit of our approach on four different 
datasets, while in Section. [6] we propose a simple scheme to segment objects in the image domain, 
without knowing their bounding boxes. We conclude in Section. [7] with a summary of our findings. 


2 Related Work 


The layout of object parts (shape, for short) obeys statistical constraints that can be both strict (e.g. 
head attached to torso) and diverse (e.g. for hair). As such, accounting for these constraints requires 
statistical models that can accommodate multi-modal distributi ons. Statistical shape models tradi¬ 
tionally usedJnvision 1 _suchas^ctive Appearance Models [Cootes et ak] ( |2001| ) or Deformable Part 
Models Felzenszwa lb et al.| ( |2010| ) need to determine in advance a small, fixed number of mixtures 
(e.g. 3 or 6), which may not be sufficient to encompass the variability of shapes due to viewpoint, 
rotation, and object deformations. 

A common approch in previous works has been combining appearance features with a shape model 
to enforce a valid spatial part structure. In Bo & Fowlkes ( |2Q 11 [ ), the authors compute appearance 
and shape features on oversegmentations of cropped pedestrian images from the Penn-Fudan pedes¬ 
trian dataset [Wang et al.| ( 2007| ); |Bo & Fowlkes] ( |20 lty . They use color and texture histograms to 
model appearance and spatial histograms of segment edge orientations as the shape features. The 
label of each superpixel is estimated by comparing appearance and shape features to a library of ex¬ 
emplar segments. Small segments are sequentially merged into larger ones and simple constraints, 
(such as that “head” appears above “upper body” and that “hair” appears above “head”) enforce a 
consistent layout of parts in resulting segmentations. 


Multi-modal distributions can be naturally captured through distributed representations Hinton et al. 
( [1986 ), which represent data through an assembly of complementary patterns that can be com- 
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bined in all possible wa ys. Restricted Boltzmann Machines (RBMs) Smolensky (1986); Hinton 
& Salakhutdinov] ( [2006 ) provide a probabilistic distributed representation that can be understood 
as a discrete counterpart to Factor Analysis (or PC A), while their restricted, bipartite graph topol¬ 
ogy makes sampling efficient. Stacking together multiple RBMs into a Deep Boltzmann Machine 
(DBM) architecture allows us to build increasingly powerful probabilistic models of data, as demon¬ 
strated for a host of diverse modalities e.g. in |Salakhutdinov & Hinton|(2009| . 


In Eslami et al. ( |2014| ) RBMs are thoroughly studied and assessed as models of shape. The authors 
additionally introduce the Shape Boltzmann Machine (SBM), a two-layer network that combines 
ideas from part-based modelling and DBMs, and show that it is substantially more flexible and 
expressive than single-layer RBMs. The same approach was extended to deal with multiple region 
labels (parts) in Eslami & Williams] ( 2012) and coupled with a model for part appearances. The 
layered architecture of the model allows it to capture both local and global statistics of the part 
shapes and part-based object segmentations, while parameter sharing during training helps avoid 
overfitting despite the small size of the training datasets. 


The discriminative training of RBMs has been pursued in shape modelling by |Kae et al] ( |2013| ) in 
a probabilistic setting and by | Yang et a l. (2014) in a max-margin setting. We pursue a probabilistic 
setting and detail our approach below. Despite small theoretical differences, the major practical 
difference between our method and the aforementioned ones is that we do not use any superpixels, 
pooled features, or boundary signals, as |Kae et alT] ( |2013| ); |Yang et ak| ( |2014| ) do, but we rather 
entirely rely on the CNN scores. 


3 DCNNS FOR SEMANTIC PART SEGMENTATION 


Deep Convolutional Neural Networks have proven to be particulary successful in “Semantic Im- 
age Segmentation”, the tas k of pixel-wise labeling of images Serma net et al. ( 2014| ); Long et al. 
( |2014| ); |Chen et al.|(|2014a| ). In this section we adapt the recently introduced, state-of-art DeepLab 
system [Chen et al.| ( |2014a| ) to our task of semantic part segmentation. 


Following Chen et al.] |2014a ), we adopt the architecture of the state-of-art 16-layer classification 
network of Simony an & Zisserman|( |2014| ) (VGG-16). We employ it in a fully-convolutional man- 
ner, tu r ning it into a dense feature extractor fo r semantic image segmentation, as in Sermane t et al.| 
( |2Q14| ); |Oquab et al.] ( |2014| ); Long et ak| ( |2014| ), t reating the last fully -connected layers of the DCNN 
as 1 x 1 spatial convolution kernels. Similarly to |Chen et al.| ( |2014a ), we employ linear interpolation 
to upsample by a (factor of 8) the class scores of the final network layer to the original image resolu¬ 
tion. We learn the DCNN network parameters using training images annotated with semantic object 
parts at the pixel-level, minimizing the cross-entropy loss averaged over all image positions with 
Stochastic Gradient Descent_(S GD), initializingnetwork parameters from the Imagenet-pretrained 
VGG-16 model of |Simonyan & Zisserman| ( [20T3] ). 


The model’s ability to capture low-level information related to region boundaries is enhanced by 
employing the fully-connected Conditional Random Field (CRF) of |Krahenbuhl & Koltun|p011| ), 
exploiting its ability to combine fine edge details with long-range dependencies. This particularly 
simple combination is both efficient and effective: the DCNN evaluation runs at 8 frames per second 
for a 321 x 321 image on a GPU and CRF inference requires 0.5 seconds on a CPU. Similarly to 
Chen et al. ( |2014a| , we set the dense CRF hyperparameters by cross-validation, performing grid 
search to find the values that perform best on a small held-out validation set for each task. 


In order to simplify the evaluation of the learned networks we fine-tune one network per object cate¬ 
gory. The system is thoroughly evaluated in Section. [5j qualitative results show that it is surprisingly 
effective in segmenting parts even for objects such as horses that exhibit complicated, articulated de¬ 
formations. While this DCNN + CRF model is very powerful, it can still make gross errors in some 
cases. Such errors could be corrected by introducing knowledge of the layout of objects, allowing 
for a better, more principled use of global information. Integrating this information is the goal of 
Section. |4] 
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4 Conditional Boltzmann Machines 


The aim of this section is to construct a probabilistic model of image segmentations that can capture 
prior information on the layout of an object category. The goal of this model is to complement and 
correct information extracted bottom-up from an image by the DCNN as explained in the previous 
section. 


In order to construct this model, we introduce three types of variables: (i) the output v of the densely- 
computed DCNN that is visible during both training and testing; (ii) the binary latent variables h 
that are hidden during both training and testing; and (iii) the ground-truth segmentation labels y that 
are observed during training and inferred during testing. The latter is a one-hot vector for each pixel, 
with yi : k = 1 indicating that pixel i takes label k out of a set of K possible choices (the parts plus 
background). 

The conditional probability P(y, h|v; W) of the labels and hidden variables given the observed 
DCNN features is the Boltzmann-Gibbs distribution 


P(y,h|v;W0 


exp(—P(y, h, v; W)) 

E y ,h ex p(-£(y- h - v; W)) 


( 1 ) 


where E(y, h, v; W) is an energy function described below. The posterior probability of the la¬ 
belling is obtained by marginalizing the latent variables: 


P(y|v; W) = P(y, h|v; W). 

h 


( 2 ) 


The goal is to estimate the parameters W of the energy function during training and to use 
P(y | v ; VL) during testing to drive inference towards more probable segmentations. 


Before describing the energy function P(y, h|v; W) in detail note that (i) the DCNN-based quan¬ 
tities v are always observed and the model does no t descr i be their distribu tion; in other words, 
we construct a conditional model of y Laffert yet al. | ( 200l] ) ; He et al. ( 2004| ); (ii) unlike common 
CRFs, there are also hidden variables h, which results in a Hidden Conditional Random Fields 
(HCRFs) Quat toni et al.| ( [2007] ); Murphy (2012); (iii) however, unlike the loopy graphs used in 
generic HCR Fs, t he factor graph in this model is bipartite, which makes block Gibbs sampling 
possible (Sec.pJl). 


Consider now the relationship between the DCNN output v and the pixel label y and recall that v 
are obtained from the last layer of the DCNN. The DCNN is trained so that, for a given pixel i, v ?: 
contains the class posteriors up to the softmax operation: 


P(yi,k = i|v) 


exp(vj, fc ) 
Ef=l exp(v iifc /) 


(3) 


This suggests that can be used as a bias term for y^ in the energy model, such that a larger 
value of rewards the assignment y ^ = 1. The raw values of v are rescaled using a set of 
learnable parameters which allows auto-calibration during training. The contribution to the energy 
term is then: 


-E’cNN(y,v;PF) = ^ w£ fc ,y iife v i)fc /, 

i,k,k' 


(4) 


where the CNN calibration parameters, w c . are contained in the overall model parameters W. Note 
also that this formulation allows to learn interactions between classes as class k' as predicted by the 
DCNN can vote through weight w k k , for class k in the energy. 

We can now write the term linking output and hidden variables, which takes the form of an RBM: 


-E , RBM(y, h; W) = y i,fc w ij,fe h j + 

i,j,k i,k 


(5) 


Note that this does not include any ‘lateral’ connection between the observed variables, or between 
the hidden variables (this would correspond to terms in which pairs of the same type of variables 
are multiplied). Instead, there are two types of terms. The first type has biases wf k for each pixel 
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location i and label k, favoring certain labels based on their spatial location only. The second type 
expresses the interaction between labels and latent variables through the interaction weights w?^. 
These weights determine the effect that activating the hidden node h j has on labelling position i as 
part k. Activating h j will favor or discourage simultaneously the activation of labels at different 
locations according to the pattern encoded by the weights w^. - intuitively latent variables can in 
this manner encode segmentation fragments. 

The overall energy is obtained as the sum of these two terms: 

E( y, h, v; W) = £ , CNN(y, V; W) + PR B M(y, h; W) (6) 

By aggregating the output variables y, the hidden variables h, and the observable variables v into a 
single vector z, the energy above can be rewritten in the form: 

E( z; W) = z T Wz (7) 

where W is a matrix of interactions. 


4.1 Parameter Estimation for Conditional RBMs 


Given a set of M training examples X = {(y 1 , v 1 ),..., (y M , v M )}, parameter estimation aims at 
maximizing the conditional log-likelihood of the ground-truth labels: 


M 


M 


S(W) = logP(y m |v m ; W) = X) logE 


exp(—£(y m ,h,v; W)) 


Z(v m ) 

m =1 m = 1 h v ' 

where Z(v m ) = ^ exp(— E(y, h, v; W)). 

y,h 


( 8 ) 

(9) 


Using the notation of Equation. |7J a parameter W k, m connects nodes zand z m that can be either 
hidden or visible. The partial derivative of the conditional log-likelihood with respect to W k,m is 
given by [Murphy] ( |2012| : 


dS 

OW k ^m 


M 


where (•, •) denotes expectation. 

In order to compute the first term, the y and v components of the z vector are given and one has to 
average over the posterior on h to compute the expectation of z/eZ m . To do so, one starts with the 
CNN scores (v) and the ground-truth segmentation maps (y) and computes the posterior over the 
hidden variables h, which can be obtained analytically. Then he computes the expectation of the 
product of any pair of interacting nodes, also in closed form. 


In order to compute the second term one needs to consider the joint expectation over segmentations 
y and hidden variables h when presented with the CNN scores v. The exact computation of this 
term is intractable, and is instead computed through Monte Carlo approximation using Contrastive 
Divergence Hinton Namely we initialize the state y to y m , perform C = 10 iterations of Block- 
Gibbs sampling over y and h, and use the resulting state as a sample from P(h, y |v m ; W). 


This training algorithm is identical to RBM training with the difference that the partition function is 
image-dependent, resulting in minor algorithmic modifications. 


5 Experiments 

We evaluate our method on four datasets (LFW,Penn-Fudan,CUB and PASCAL-parts) and report 
qualitative and quantitative results. We compare the accuracy of our pipeline before and after refin¬ 
ing part boundaries using the fully-connected CRF, and also report on the improvements delivered 
by the combination of RBMs with CNNs on three categories (faces, cows, horses). While using the 
exact same settings for the network and parameter values described in Section. [3] we obtain state- 
of-the-art results when comparing to carefully engineered approaches for the individual problems. 
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Method 

head 

upper body 

lower body 

FG 

BG 

Average 

SBP Bo & Fowlkes 1201 1) 

51.6 

72.6 

71.6 

73.3 

81.0 

70.3 

DON Luo et al. 1Z015; 

60.2 

75.7 

73.1 

78.4 

85.0 

74.5 

ULLUO etai. |2U13J 

60.0 

76.3 

75.6 

78.7 

86.30 

75.4 

Uurs (CNN) 

67.8 

77.0 

76.0 

83.0 

85.4 

77.8 

Ours (CNN+CRF) 

64.2 

81.5 

80.9 

84.4 

87.3 

79.7 


Method 

Accuracy (SP) 

GLOC Kae et al. |2013J 

94.95% 

uurs (linin' 

96.54% 

Ours (CNN+RBM) 

96.78% 

Ours (CNN+CRF) 

96.76% 

Ours (CNN+RB M+CRF) 

96.97% 


(a) Segmentation accuracies on Penn-Fudan. (b) LFW (superpixel accuracies). 


Table 1: Segmentation accuracies of our system on the Penn-Fudan and LFW datasets 



Figure 1: Left: Pedestrian parsing results on Penn-Fudan dataset. From top to bottom: a) Input 
image, b) SBP Bo & Fowlkes ( 201l| ), c) Raw CNN scores, d) CNN+CRF, e) Groundtruth. Right: 
Part segmentation results for car, horse and cow on the PASCAL-Parts dataset. From left to right: 
a) Input image, b) Masks from raw CNN scores, c) Masks from CNN+CRF, d) Groundtruth. Best 
seen in color. 


Penn-Fudan pedestrian datataset 


The Penn-Fudan dataset [Bo & Fowlkes| ( [2011| ) provides manual segmentations of 170 pedestrians 
into head, hair, clothes, arms, legs and shoes/feet. This dataset does not come with a train/test split, 
so we had to train our networks on a different dataset. We finetune our network on the Pascal person 
category, using all images and corresponding part annotations from |Chen et al.|p~014b| l. 


A complication is that in PASCAL-Parts clothing is not taken into account when segmenting people 
- the only regions are “torso”, “arms”, “legs” and “feet”; whereas in Penn-Fudan the semantic parts 
used are “hair”, “face”, “upper clothes”, “arms”, “lower clothes”, “legs” and “shoes/feet”. To facil¬ 
itate comparison of the methods, we merge “torso” and “arms” from PASCAL and “upper clothes” 
and “arms” from Penn-Fudan into “upper body”; similarly we merge “legs” and “feet” from PAS¬ 
CAL and “lower clothes”, “legs” and “feet” from Penn-Fudan into “lower body”. Other methods 
also report results on these two superregions, making comparison possible. Detailed numbers for 
Intersection-over-Union (IOU) for each part are included in Tab. [Ta[ 


Labeled Faces in the Wild 


Labeled Faces in the Wild (LFW) is a dataset containing more than 13000 images of faces collected 
from the web. For our purposes, we used the “funneled” version of the dataset, in which images have 
been coarsely aligned using a congealing-style joint alignment approach [Huang et liL] ( |2007j) . This 
is the subset also used in Ka e et al. ( |2013 ) and consists of 1500 train, 500 validation and 927 testing 
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Figure 2: Face parsing results on LFW. From top to bottom: a) Input image, b) Masks from raw 
CNN scores, c) CNN+CRF, d) Groundtruth. CRF sharpens boundaries, especially in the case of 
long hair. In the 6th image we see a failure case: our system failed to distinguish the man’s very 
short hair from the similar-color head skin. 


images of faces, and their corresponding superpixel segmentations, with labels for background, hair 
(including facial hair) and face. We train our DCNN on the 2000 trainval images and evaluate on 
the 927 test images, using superpixel accuracy as in |Kae et ak] ( |2013| ) for the purpose of comparison. 
Since our system returns pixelwise labels for each image, we employ a simple scheme to obtain 
superpixel labels: for each superpixel we compute a histogram of the pixel labels it contains and 
choose the most frequent label as the superpixel label. 


Caltech-UCSD Birds-200-201 1 


CUB-200-201l |Wah et al.| ( [2011[ ) is a dataset for fine-grained recognition that contains over 11000 
images of various types of birds. CUB-200-2011 does not contain segmentation masks for parts, 
however Zhang et al. Zhang et ak| ( 2014| ) provide bounding boxes for the whole bird, as well as for 
its head and body. In that work, the authors describe a system for detecting object parts under two 
different settings: 1) When the object bounding box is given, and 2) when the location of the object 
is unknown. 

We can assess the performance of our system by converting segmentation masks of bird parts to their 
(unique) corresponding bounding boxes. We train our system on the trainval bird part annotations 
in the Pascal Parts dataset and use the CUB-200-2011 test set (5793 images) for evaluation. We 
consider five parts: head, body, wings and legs and compare with [Zhang et al.| ( [20 14[ ). We only 
focus on the case where the bounding box is considered to be known. Given the bird’s bounding box, 
we compute the segmentation masks of four parts using our network: head, body, wings and legs. 
We then use the label masks for head and body to construct bounding boxes. Since our final goal 
is to convert a segmentation mask to a bounding box, sharp boundaries are not mandatory and we 
only utilize the coarse CNN scores. We measure accuracy in terms of PCP (Percentage of Correctly 
Localized Parts). Our simple approach proves effective and outperforms [Zhang et al.| ( |2014| ) by 
2% in detecting bounding boxes for the bird’s body, raising performance from 79.82% to 81.79% 

. Part R-CNN is ahead by 4% in localizing birds’ heads (68.19% vs. 64.41%) but this is a system 
that was specifically trained for this task. Furthermore, R-CNN capitalizes on the large number of 
region proposals (typically more than 1000) returned by the Selective Search algorithm |Uijlings et al.| 
( |2013| ). These bottom-up proposals can potentially be a bottleneck when trying to localize small 
parts, or when higher IOU is required [Zhang et a~ ( 2014| ). In contrast, our approach yields a single, 
high-quality segmentation proposal for each object part, removing the need to score “partness” of 
hundreds or thousands of individual regions. 


Pascal Parts dataset 


In our last experiment, we evaluate our system on the PASCAL Parts dataset Chen et al. (2014b). 
This dataset includes high quality part annotations for the 20 PASCAL object classes (train and 
val sets), but was released fairly recently, so there are not many works reporting part segmentation 
performance. The only work that we know of is by Lu et al. Lu et al. ( |2014j ) on car parsing, but 
the authors do not provide quantitative results in the form of some accuracy percentage, making 
comparison challenging. 
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Method 

head 

neck 

torso 

legs 

tail 

BG 

Average 

Method 

head 

torso 

legs 

tail 

BG 

Average 

CNN 

55.0 

34.2 

52.4 

46.8 

37.2 

76.0 

50.3 

CNN 

57.6 

62.7 

38.5 

11.8 

69.7 

48.03 

CNN+CRF 

55.4 

31.9 

53.6 

43.4 

37.7 

77.9 

50.0 

CNN+CRF 

60.0 

64.8 

34.8 

9.9 

72.4 

48.38 


(a) IOU scores on PASCAL-Parts horse class (b) IOU scores on PASCAL-Parts cow class. 


Method 

body 

plates 

lights 

wheels 

windows 

BG 

Average 

CNN 

73.4 

41.7 

42.2 

66.3 

61.0 

67.4 

58.7 

CNN+CRF 

75.4 

35.8 

36.1 

64.3 

61.8 

68.7 

57.0 


(c) IOU scores on PASCAL-Parts car class. 


Method 

Val set 

Val subset 

Method 

Val set 

Val subset 

CNN 

77.3 

83.7 

CNN 

76.6 

86.4 

CNN+RBM 

77.7 

84.4 

CNN+RBM 

76.3 

86.9 

CNN+CRF 

79.1 

85.1 

CNN+CRF 

77.6 

88.1 

CNN+RBM+CRF 

79.2 

84.7 

CNN+RBM+CRF 

76.7 

87.6 


(d) PASCAL-cow (e) PASCAL-horse 


Table 2: IOU scores on PASCAL-Parts. 


Nevertheless, we report our own results for horse, cow and car , which could serve as a first baseline. 
For each class we train a separate DCNN on the train set annotations (using horizontal flipping to 
augment the training dataset), and test on the validation set. Our quantitative results are compiled in 
Table. |2j while in Figure. [T] we show qualitative results. 


In Tables [lb 2d|2e we report on the relative performance of the CNN-based system compared to the 
CNN-RBM combination, as well as the results we obtain when combined with the CRF system. For 
the “cow” and “horse” categories we also consider a separate subset of images containing poses of 
only moderate variation, to focus on cases that should be tractable for an RBM-based shape prior. 


We observe that while the RBM typically yields a moderate improvement in performance over the 
CNN, this does not necessarily always carry over to the combination of these results with the CRF 
post-processing module. Th is sugg ests that also the CRF stage should be trained jointly, potentially 
along the lines of |Kae et al.| ( [2013 ), which we leave for future work. 


6 Multi-scale Semantic segmentation in the wild 


In all the experiments described so far we assume that we have a tight bounding box around the 
object we want to segment. However, knowing the precise location of an object in an image is 
a challenging problem in its own right. In this section we investigate possible ways to relax this 
constraint, by applying our system on the full input image and segmenting object parts “in the 
wild”. There are two ways to attack this task. An obvious approach would be to simply run the 
DCNN on the input image; since the network is fully convolutional, the input can be an image 
of arbitrary height and width. A complication that arises is that our system has been trained using 
examples resized at a canonical scale (321 x 321), whereas an image might contain objects at various 
scales. As a consequence, using a single-scale model will probably fail to capture fine part details of 
objects deviating from its nominal scale. Another approach is to utilize an object detector to obtain 
an estimate of the object’s bounding box in the image, resize the cropped box in the canonical 
dimensions, and segment the object parts as in the previous sections. The obvious drawback is that 
potential errors in detection - recovering a misaligned bounding box or missing an object altogether 
- hinder the final segmentation result. 


We explore a simple way of tackling these issues, by coupling our system with a recent, state-of-the- 
art object detector [Ren et al.| ( |2015| ). We focus on the person class from the PASCAL-Parts dataset, 
training our part segmentation network on the train set and using val to test our performance. Our 
approach consists of the following steps: We start by applying the CNN over the full image domain, 
at three different scales (original dimensions + upsampling by a factor of 1.5, 2) (we did not use 
finer resolutions, due to GPU RAM constraints). We then use [Ren et al.| (|2015|) to obtain a set of 
region proposals, along with their respective class scores. Unlike Hariharan et al.| ( [2014| ), we keep 
all proposals, omitting any NMS or thresholding steps, and use their bounding boxes as an indicator 
for scale selection. We associate each bounding box with its “optimal scale”, namely the scale at 
which the bounding box dimensions are closer to the nominal dimensions of the network input: s Q = 
argmin |hb — hjsf | + |Wb — wn |, hb,Wb being the box’s height and width, and h]y = wn = 321 
being the nominal scale at which the network was trained. CNN scores at an image location x are 
selected from the optimal scale of the box that contains it; if x is contained in multiple boxes, we 
use the scale and scores supported by the box with the highest detector score. 
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This approach allows us to synthesize a map of part scores, combining patches from finer resolutions 
when the object is small, and coarser resolutions when the object is large. At test time, we extract 
all ground truth bounding boxes for the person class in PASCAL-Parts val set and calculate pixel 
accuracy within the boxes. As a baseline for comparison we use the naive evaluation of the CNN, 
applied on a single scale (original image dimensions). This simple approach boosts performance 
from 73.9% to 74.7% without training the network with multi-scale data, even though end-to-end 
training could yield further improvements. 



(a) Input image (b) Single resolution (c) Multi-scale scheme (d) Ground truth 


Figure 3: Combining CNN features from multiple scales, we can recover segmentations for finer 
object parts. 


7 Discussion 

In this work we have demonstrated that a simple and generic system for semantic segmentation 
relying on Deep CNNs and Dense CRFs can provide state-of-the-art results in the task of semantic 
part segmentation. We have also explored methods of integrating high-level information through a 
joint discriminative training of the network with a statistical, category-specific shape prior, showing 
that these can act in a complementary manner to the bottom-up information provided by DCNNs. 
We also proposed a simple, yet effective multi-scale scheme for segmentation “in the wild”, guided 
by a fast object detector, that is used both to propose possible object boxes, and select the appropriate 
scale for segmentation in a pyramid of CNN scores. 

In future work we aim at exploring the joint training of DCNNs with high-level shape priors in an 
end-to-end manner, as well as to further explore the practical applications of semantic part segmen¬ 
tation in detection and fine-grained recognition. 
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